## [1] 4898 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
the distribution of fixed.acidity looks like a normal distribution with some outliers.
the distribution of volatile.acidity looks like what fixed.acidity get with different mean and std.
the distribution of citric.acid looks like the two above but with a strange peak around 0.5
the distribution of residual.sugar is highly left-skewed, so I transform it using scale_x_log10 and the new distrubution appears bimodal.
the distribution looks like a normal distribution below 0.1 and have many huge outliers.
the distribution just looks like the above one, it’s a normal distribution below 100 and have some huge outliers.
No surprise, the distribution looks like the above two.
the variation is vary small, but if I take a close look at range between 0.99 and 1.005, the distribution is more like a uniform distribution.
It’s a beautiful normal distribution without any extreme outlier!
Basically, it’s a beautiful normal distribution, too.
It’s a slight left-skewed distribution without extreme outliers.
It is more like a categorical variable with only 7 different values, so I made a new column called quality_level and label ‘A’ to the quality of 9 and ‘B’ for 8 and so on…
##
## G F E D C B A
## 20 163 1457 2198 880 175 5
There is no categorical variable in the whole data set. Most of the numerical variables have some outliers that make the Max number much higher than the average of them, only the values of variable density and pH are almost all the same, and the max value in variable alcohol and quality doesn’t lie too far away from the mean.
The main features of interest in the data set is the quality, I will do some further exploration to see the correlation between it and other features and try to find which features combined affect the quality most.
I think the alcohol feature will support my investigation much, since it is what affect a wine the most by intuition.
Since every variable may affect the density, I think density may be another features that I want to take a closer look at.
I create quality_level from quality and changed it from numerical value to categorical value. I think it will be helpful when I want to draw a relationship between others variables and using quality as classification variable.
I log-transformed the residual.sugar distributions and the transformed distribution looks like bimodal distribution with no clear peak.
I cut of outliers for most of the variables and most of them look like normal distribution after doing that, pH, sulphates and quality are the only three variables that I don’t need to do any further manipulation.
citric.acid have two unusual peaks around 0.5 and 0.75.
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.25581431 0.002857966
## fixed.acidity -0.255814305 1.00000000 -0.022697290
## volatile.acidity 0.002857966 -0.02269729 1.000000000
## citric.acid -0.149899918 0.28918070 -0.149471811
## residual.sugar 0.006623775 0.08902070 0.064286060
## chlorides -0.045645192 0.02308564 0.070511571
## free.sulfur.dioxide -0.011928911 -0.04939586 -0.097011939
## total.sulfur.dioxide -0.161979037 0.09106976 0.089260504
## density -0.185976097 0.26533101 0.027113845
## pH -0.115774132 -0.42585829 -0.031915368
## sulphates 0.009807759 -0.01714299 -0.035728147
## alcohol 0.213656245 -0.12088112 0.067717943
## quality 0.035763247 -0.11366283 -0.194722969
## citric.acid residual.sugar chlorides
## X -0.149899918 0.006623775 -0.04564519
## fixed.acidity 0.289180698 0.089020701 0.02308564
## volatile.acidity -0.149471811 0.064286060 0.07051157
## citric.acid 1.000000000 0.094211624 0.11436445
## residual.sugar 0.094211624 1.000000000 0.08868454
## chlorides 0.114364448 0.088684536 1.00000000
## free.sulfur.dioxide 0.094077221 0.299098354 0.10139235
## total.sulfur.dioxide 0.121130798 0.401439311 0.19891030
## density 0.149502571 0.838966455 0.25721132
## pH -0.163748211 -0.194133454 -0.09043946
## sulphates 0.062330940 -0.026664366 0.01676288
## alcohol -0.075728730 -0.450631222 -0.36018871
## quality -0.009209091 -0.097576829 -0.20993441
## free.sulfur.dioxide total.sulfur.dioxide density
## X -0.0119289106 -0.161979037 -0.18597610
## fixed.acidity -0.0493958591 0.091069756 0.26533101
## volatile.acidity -0.0970119393 0.089260504 0.02711385
## citric.acid 0.0940772210 0.121130798 0.14950257
## residual.sugar 0.2990983537 0.401439311 0.83896645
## chlorides 0.1013923521 0.198910300 0.25721132
## free.sulfur.dioxide 1.0000000000 0.615500965 0.29421041
## total.sulfur.dioxide 0.6155009650 1.000000000 0.52988132
## density 0.2942104109 0.529881324 1.00000000
## pH -0.0006177961 0.002320972 -0.09359149
## sulphates 0.0592172458 0.134562367 0.07449315
## alcohol -0.2501039415 -0.448892102 -0.78013762
## quality 0.0081580671 -0.174737218 -0.30712331
## pH sulphates alcohol quality
## X -0.1157741316 0.009807759 0.21365624 0.035763247
## fixed.acidity -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity -0.0319153683 -0.035728147 0.06771794 -0.194722969
## citric.acid -0.1637482114 0.062330940 -0.07572873 -0.009209091
## residual.sugar -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides -0.0904394560 0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide -0.0006177961 0.059217246 -0.25010394 0.008158067
## total.sulfur.dioxide 0.0023209718 0.134562367 -0.44889210 -0.174737218
## density -0.0935914935 0.074493149 -0.78013762 -0.307123313
## pH 1.0000000000 0.155951497 0.12143210 0.099427246
## sulphates 0.1559514973 1.000000000 -0.01743277 0.053677877
## alcohol 0.1214320987 -0.017432772 1.00000000 0.435574715
## quality 0.0994272457 0.053677877 0.43557472 1.000000000
Most of the variables didn’t have high correlation.
The variables total.sulfur.dioxide and free.sulfur.dioxide have correlation coefficient 0.6 and the reason is obvious.
residual.sugar have high positive correlation (0.84) with density and alcohol have high negative correlation(-0.78) with density.
Not surprisingly, alcohol have the highest correlation coefficient with quality(0.435).
I want to have a closer look at the relationship between quality and other variables
It didn’t look like have any correlation between them, since the quality have only seven possible values and may overplotting, so I use jitter to make a clearer plot.
Now there is a slight positive relationship between this two variables, but still not strong. This meets the correlation coefficient value 0.435 I get before.
The value has the next highest absolute value of correlation with quality is density.
The result shows a blurred trend and meets with the data I get before.
Since this two variable have strong negative correlation (-0.78) with each other, and have high relationship with quality both, if I want to investigate the variables that affect the quality, they may be covariant variables that I need to deal with, so I want to look at the scatter plot made by them.
Although residual.sugar have little relationship with quality, but it have strong relationship with density, actually, it is the strongest correlation value(0.84) in the whole dataset.
There are only 5 records with quality_level ‘A’(best) and 20 records with quality_level ‘G’(worst). The plot show that most of the level ‘A’ wines have alcohol more than 12 and most of the level ‘G’ wines is lower than 12. Most of wines have alcohol around 10 and only a few there have quality above ‘C’.
level A:
## [1] 10.4 12.4 12.5 12.7 12.9
level G:
## [1] 9.8 11.7 8.5 11.5 12.6 9.6 9.1 12.4 11.0 9.1 9.4 11.0 9.7 10.4
## [15] 10.1 8.0 11.0 11.0 10.5 10.5
The variable with the next high correlation coefficient with quality is chlrides
Most of the chlorides data lies in under 0.1, I look both the part under 0.1 and those outliers above 0.1, but there is no much surprise.
The two sulfur-related variables have strong correlation with each other, but both of them seems have no relationship with quality.
It’s interesting that alcohol have some negative correlation with total.sulfur.dioxide(-0.44) and residual.sugar(-0.45), and there are some positive relation(0.40) between total.sulfur.dioxide & residual.sugar.
It seems that no matter how much the alcohol is, there is always white wines with low residual.sugar. However, when the alcohol goes higher, the chance of getting high residual.sugar white wine is getting lower.
Quality have no strong relationship with any variables, only alcohol and density have some relationship with it but not obvious. However, there are strong correlation between alcohol and density, so maybe only one variable, which I think is alcohol, have some relationship with quality. Other variables don’t affect the quality much.
There are too few example for the best and worst level, only 5 and 20 for them. It may need more data to justify what is the best farmula for level ‘A’ white wine. If I made color with different levels, I can only see the dot with level ‘B’ to ‘E’ to find some pattern.
The density is more disperse when the residual.sugar and alcohol is lower, but the variance is very small no matter what. The outlier of residual.sugar(65.8) make the outlier of density(1.03898) without surprising.
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 2782 2782 7.8 0.965 0.6 65.8
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 2782 0.074 8 160 1.03898 3.39
## sulphates alcohol quality quality_level
## 2782 0.69 11.7 6 D
The residual.sugar have some relationship with density and density have some relationship with quality, but residual.sugar seems to have no relationship with quality. I guess the reason is that although the residual.sugar will changes the density drastically, it is alcohol that changes the quality and also change the density. So the relationship between density and quality is due to alcohol instead of residual.sugar.
The alcohol values have some relationship with total.sulfur.dioxide and residual.sugar, but I don’t know the scientific reason for this. Maybe it is caused by the brewing procedure.
The strongest relationship between all the variable are residual.sugar and density(0.84). The second strongest relationship is alcohol and density with negative value(-0.78).
Since alcohol and density are the two variables with highest relationship with quality, I want to see the relationship between these three variables
There are more level A-C wines lies in the area of low density and high alcohol, but high alcohol value somehow imply low density, so I may need to discard one of them if I want to do linear regression to predict the quality. I think density should be skipped since its value only have tiny variance and alcohol sounds more direct related to the quality of wine by intuition.
It’s interesing that this trend is not monotonic, the level E white wine have the highest average density and lowest average alcohol. Moreover, I think density is not a good feature to determine the level of white wine, since there are too few level A samples and lots of outliers for level B and level C samples. However, the alcohol feature also have much ourliers in level E quality and few outliers in level F, so the trend is basically monotonic, that’s why it have 0.44 correlation with quality.
There are two strange peaks at the histogram of citric.acid, I want to see what happened to these data. The peak around 0.5 is the most significant one, so I will start with it.
## Warning: position_stack requires non-overlapping x intervals
using barplot to make sure the strange peak is at 0.49, looks like there are no abnormal quality destribution in this point. The citric.acid features have no significant relationship with all other features and have almost zero relationship (-0.0092) with quality. The most significant relationship between citric.acid and other features is 0.28 (with fixed.acidity), so I try to look at it.
just one outlier of fixed.acitidy, no more strange things happen.
I create a subset with only the data with citric.acid equal to 0.49 (there are 215 records) and compare the summary and correlation with the original dataset.
summary of each dataset: (first: original, second: subset with citric.acid equal 0.49)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
##
## alcohol quality quality_level
## Min. : 8.00 Min. :3.000 G: 20
## 1st Qu.: 9.50 1st Qu.:5.000 F: 163
## Median :10.40 Median :6.000 E:1457
## Mean :10.51 Mean :5.878 D:2198
## 3rd Qu.:11.40 3rd Qu.:6.000 C: 880
## Max. :14.20 Max. :9.000 B: 175
## A: 5
## X fixed.acidity volatile.acidity citric.acid
## Min. : 280 Min. : 5.600 Min. :0.0800 Min. :0.49
## 1st Qu.:1478 1st Qu.: 6.800 1st Qu.:0.2000 1st Qu.:0.49
## Median :1554 Median : 7.400 Median :0.2500 Median :0.49
## Mean :1710 Mean : 7.489 Mean :0.2629 Mean :0.49
## 3rd Qu.:1626 3rd Qu.: 8.000 3rd Qu.:0.3000 3rd Qu.:0.49
## Max. :4679 Max. :14.200 Max. :0.8500 Max. :0.49
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.02700 Min. : 3.00
## 1st Qu.: 1.500 1st Qu.:0.03600 1st Qu.:21.50
## Median : 5.000 Median :0.04400 Median :32.00
## Mean : 5.793 Mean :0.04558 Mean :33.61
## 3rd Qu.: 8.100 3rd Qu.:0.05100 3rd Qu.:45.00
## Max. :23.500 Max. :0.23900 Max. :87.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 18.0 Min. :0.9893 Min. :2.850 Min. :0.2700
## 1st Qu.:113.5 1st Qu.:0.9928 1st Qu.:3.065 1st Qu.:0.3750
## Median :138.0 Median :0.9940 Median :3.140 Median :0.4500
## Mean :141.2 Mean :0.9943 Mean :3.163 Mean :0.4623
## 3rd Qu.:164.0 3rd Qu.:0.9956 3rd Qu.:3.240 3rd Qu.:0.5300
## Max. :247.0 Max. :1.0024 Max. :3.650 Max. :0.9800
##
## alcohol quality quality_level
## Min. : 8.50 Min. :4.000 G: 0
## 1st Qu.: 9.70 1st Qu.:5.000 F: 9
## Median :10.50 Median :6.000 E: 54
## Mean :10.48 Mean :5.893 D:110
## 3rd Qu.:11.20 3rd Qu.:6.000 C: 36
## Max. :13.00 Max. :9.000 B: 5
## A: 1
difference in correlation: (correlation of each variable in the new dataset minus the original one)
## X fixed.acidity volatile.acidity
## X 0.00000000 0.169338319 0.081109140
## fixed.acidity 0.16933832 0.000000000 0.007850737
## volatile.acidity 0.08110914 0.007850737 0.000000000
## citric.acid NA NA NA
## residual.sugar 0.05076183 -0.102222907 0.198221706
## chlorides 0.04882760 -0.126606540 -0.007668191
## free.sulfur.dioxide 0.05408238 -0.169224582 0.211615334
## total.sulfur.dioxide 0.08679813 -0.269339593 0.129207253
## density 0.20116234 -0.187003123 0.129208969
## pH -0.07145433 0.073941212 -0.064242041
## sulphates 0.28740524 -0.136438565 -0.042287170
## alcohol -0.25512162 0.233466205 -0.002368664
## quality -0.05382213 0.014865896 0.070077310
## citric.acid residual.sugar chlorides
## X NA 0.05076183 0.048827596
## fixed.acidity NA -0.10222291 -0.126606540
## volatile.acidity NA 0.19822171 -0.007668191
## citric.acid 0 NA NA
## residual.sugar NA 0.00000000 -0.016974896
## chlorides NA -0.01697490 0.000000000
## free.sulfur.dioxide NA 0.20464880 -0.034777585
## total.sulfur.dioxide NA 0.11837207 -0.006519063
## density NA -0.01579581 -0.034730974
## pH NA 0.12975861 0.136028548
## sulphates NA -0.04865309 0.120901945
## alcohol NA 0.13007051 0.108414353
## quality NA 0.09646195 0.009786642
## free.sulfur.dioxide total.sulfur.dioxide density
## X 0.05408238 0.086798132 0.20116234
## fixed.acidity -0.16922458 -0.269339593 -0.18700312
## volatile.acidity 0.21161533 0.129207253 0.12920897
## citric.acid NA NA NA
## residual.sugar 0.20464880 0.118372067 -0.01579581
## chlorides -0.03477759 -0.006519063 -0.03473097
## free.sulfur.dioxide 0.00000000 0.125635140 0.14705069
## total.sulfur.dioxide 0.12563514 0.000000000 0.02359608
## density 0.14705069 0.023596084 0.00000000
## pH 0.09835056 0.122556755 0.17865955
## sulphates -0.08556566 -0.126602800 0.01771059
## alcohol 0.02584678 0.124724373 0.09505501
## quality 0.06932216 0.095331822 0.02175604
## pH sulphates alcohol quality
## X -0.071454330 0.28740524 -0.255121617 -0.053822126
## fixed.acidity 0.073941212 -0.13643856 0.233466205 0.014865896
## volatile.acidity -0.064242041 -0.04228717 -0.002368664 0.070077310
## citric.acid NA NA NA NA
## residual.sugar 0.129758611 -0.04865309 0.130070509 0.096461950
## chlorides 0.136028548 0.12090194 0.108414353 0.009786642
## free.sulfur.dioxide 0.098350558 -0.08556566 0.025846779 0.069322164
## total.sulfur.dioxide 0.122556755 -0.12660280 0.124724373 0.095331822
## density 0.178659548 0.01771059 0.095055007 0.021756044
## pH 0.000000000 -0.06671252 -0.113451722 -0.001141212
## sulphates -0.066712519 0.00000000 -0.124292817 -0.068626260
## alcohol -0.113451722 -0.12429282 0.000000000 0.044543108
## quality -0.001141212 -0.06862626 0.044543108 0.000000000
I still can’t find any significant difference in the data shows above. The mean, median and correlation with other variables doesn’t have any obvious change in the new dataset. So I think it may just happened by chance without further reason.
The only feature that seems to have some effect to the pH values is fixed.acidity, but none of them affect the quality much.
As I mentioned above, the residual.sugar have strong relationship with density and density have strong relationship with quality, but residual.sugar seems have no relationship with quality.
I try to apply log 10 transformation on residual.sugar in the last plot since its distribution is highly left skewed, but the plot din’t show much more insight than the non-transformed one.
I try to take a closer look at those values that have some relationship with quality. I got more idea about how the values distributed but didn’t get a better farmula to predict the quality.
The surprising feature is that the boxplot shows that the relationship between alcohol and quality_level is not monotonic if we skipped those values that seem to be outliers in each level.
I try to look at other interesting feature, the strange peak at citric.acid, even it seems to have nothing to do with my interest feature - “quality”. However, I can’t find any significant difference between those records with 0.49 citric.acid and others. So I think it just happen by chance.
I try to find more relationship with quality and other features, but I can’t find other features that seems to have strong relationship with quality even doing log or power transformation. The only two features that quality have a high relationship with, alcohol(0.44) and density(-0.3), seems to correlated with each other and I need to choose one of them to use. After all, the best model I can get is around 0.2 R-squared value. So I don’t have a great model to show.
Alcohol is the features that affect the white wine quality most, this plot shows that the better the quality of wine, the higher the alcohol precentage it has in general.
These two features have the strongest correlation(0.84) in the whole dataset, we can see this relationship in this plot clearly. Moreover, another interesting thing that can be seen here is that although density has some relationship with quality (horizontal color difference), the residual.sugar seems to have no relationship with quality (vertical color difference).
Take a closer look at the two features with the highest correlation value with quality, alcohol(0.43) and density(-0.3). Those high-quality records are concentrated in the upper-left corner. In the meanwhile, there is a clear linear regression line which indicates the correlation between density and alcohol is high, which means that they are covariance variables and maybe only one of them have the real relationship with the quality feature.
In this analysis project, I started by looking at the distribution of each feature and find out that most of them are normal distribution after cutting out those outliers. After having some basic idea of each feature, I try to look at the correlation value between all the features and put most my attention on the correlation value between quality and others. Most of the plot didn’t surprising me much, which is happy on the one hand, but worried on the other. It’s happy that the plot can verify the numerical data well but worried that I can’t find some underlying relationship that can be used for the prediction model that I want to make. It is a little frustrated that I can’t get a good prediction model in the end, but I think things like this is not surprising in the real world. If I want to do further exploration, I should learn more knowledge about wine and chemistry, that may help me figured out how to manipulate my data to find out some underline relations. Another thing I can try in the future is using more different learning models to try out the result, maybe SVM will help me find out those kernel functions that can help predict the quality and I just didn’t see here. Finally, finding more data may be the most straightforward way to improve my model, but this may not be an easy job to do.
[1] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib